Welcome to “How to Create Small Multiples in R with ggplot2!” In this scenario, we will cover how to arrange numerous figures together in a grid or facet.
This scenario assumes you’ve done some data wrangling with tidyr and dplyr, and data visualization with ggplot2.
It’s best to start a project off with a “view of the forest from outside the trees.” The technical term for this is data lineage, which:
“Includes the data origin, what happens to it, and where it moves over time.”
Having a “bird’s eye view” of the data ensures there weren’t any problems with exporting or importing. Data lineage also means understanding where the data is coming from (e.g., a relational database, API, flat .csv files, etc.).
Knowing some of the technical details behind a dataset lets us frame the questions or problems we’re trying to tackle. In this scenario, we will use tabular data data (i.e., spreadsheets). Tabular data organizes information into columns and rows.
Let’s load some data and get started!
Launch an R console by clicking here: R
The package we’ll use to view the entire dataset with R is skimr. We will install and load the following packages:
install.packages(c("tidyverse", "skimr"))
library(tidyverse)
library(skimr)Sometimes data is so complex and layered that it requires multiple graphs. In this case, we will want to arrange numerous figures together in a grid or facet. Charts presented this way are called “small multiples,” a term popularized by Edward Tufte:
“Small multiples resemble the frames of a movie: a series of graphics, showing the same combination of variables, indexed by changes in another variable.” Edward Tufte, The Visual Display of Quantitative Information.
We are going to demonstrate faceting with a dataset we’ve created from the Internet Movie Database (IMDB).
IMDB makes multiple datasets available for download. We’ve combined the title.ratings.tsv, name.basics.tsv, and title.principals.tsv datasets into the ImdbData dataset with the following columns:
tconst = alphanumeric unique identifier of the title (used for joining)nconst = alphanumeric unique identifier of the name/person (used for joining)category = the category of job that person was inprimaryName = name by which the person is most often creditedbirthYear = in YYYY formataverageRating = weighted average of all the individual user ratingsnumVotes = number of votes the title has receivedprimaryTitle = the more popular title/the title used by the filmmakers on promotional materials at the point of releaseoriginalTitle = original title, in the original languageisAdult = non-adult title or adult titlestartYear = represents the release year of a title. In the case of TV Series, it is the series start year.runtimeMinutes = primary runtime of the title, in minutesgenres = includes up to three genres associated with the title.age_lead = age of actor/actress at time of film (age_lead = startYear - birthYear)# click to execute code
ImdbData <- readr::read_csv("https://bit.ly/2O2ZKDC")
glimpse(ImdbData)#> Rows: 136,925
#> Columns: 14
#> $ tconst <chr> "tt0000574", "tt0000630", "tt0000886", "tt0001101", "…
#> $ nconst <chr> "nm0846887", "nm0624446", "nm0609814", "nm0923594", "…
#> $ category <chr> "actress", "actress", "actor", "actor", "actor", "act…
#> $ primaryName <chr> "Elizabeth Tait", "Fernanda Negri Pouget", "Jean Moun…
#> $ birthYear <dbl> 1879, 1889, 1841, 1870, 1869, 1844, 1891, 1879, 1867,…
#> $ averageRating <dbl> 6.1, 3.2, 5.0, 5.2, 4.7, 5.8, 5.5, 3.9, 4.6, 5.3, 4.2…
#> $ numVotes <dbl> 609, 11, 23, 13, 10, 22, 47, 11, 13, 28, 14, 13, 25, …
#> $ primaryTitle <chr> "The Story of the Kelly Gang", "Hamlet", "Hamlet, Pri…
#> $ originalTitle <chr> "The Story of the Kelly Gang", "Amleto", "Hamlet", "A…
#> $ isAdult <chr> "non-adult title", "non-adult title", "non-adult titl…
#> $ startYear <dbl> 1906, 1908, 1910, 1910, 1910, 1912, 1910, 1911, 1910,…
#> $ runtimeMinutes <dbl> 70, NA, NA, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, N…
#> $ genres <chr> "Biography,Crime,Drama", "Drama", "Drama", "\\N", "Cr…
#> $ age_lead <dbl> 27, 19, 69, 40, 41, 68, 19, 32, 43, 28, 52, 39, 23, 4…
In the following, we build a skimr::skim() of the ImdbData dataset:
# click to execute code
SkimImdbData <- skimr::skim(ImdbData)
summary(SkimImdbData)| Name | ImdbData |
| Number of rows | 136925 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
First, we will review the character variables.
# click to execute code
SkimImdbData %>%
skimr::yank("character")Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| tconst | 0 | 1 | 9 | 10 | 0 | 136925 | 0 |
| nconst | 0 | 1 | 9 | 10 | 0 | 41765 | 0 |
| category | 0 | 1 | 5 | 7 | 0 | 2 | 0 |
| primaryName | 0 | 1 | 2 | 38 | 0 | 41658 | 0 |
| primaryTitle | 0 | 1 | 1 | 196 | 0 | 124228 | 0 |
| originalTitle | 0 | 1 | 1 | 196 | 0 | 128171 | 0 |
| isAdult | 0 | 1 | 11 | 15 | 0 | 2 | 0 |
| genres | 0 | 1 | 2 | 31 | 0 | 1056 | 0 |
These all look complete.
The number of individual responses (n_unique) for each character variable is a good source for sanity checks. For example, the largest number of unique values belongs to the title id variable (tconst = 136925), and this is identical to the number of rows in the dataset. The next largest number belongs to the originalTitle variable (128171), and the documentation tells us this variable is the title for the film in its original language. By itself, this number doesn’t tell us much, but we can see the next largest number (124228) is the film’s primaryTitle, and it makes sense that the number of unique responses for these two variables is almost the same.
It also makes sense that the n_unique for actor/actress (nconst) is close to the actor/actress primaryName. There should be way more titles (originalTitle or primaryTitle) than genres, and there are (1056).
Finally, we can see the two binary variables we read about above (category and isAdult) only list 2 unique values (in n_unique), so it appears we imported these variables correctly.
Next, we will review the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in ImdbData:
# click to execute code
SkimImdbData %>%
skimr::focus(numeric.mean, numeric.sd,
numeric.p0, numeric.p50, numeric.p100,
numeric.hist) %>%
skimr::yank("numeric")Variable type: numeric
| skim_variable | mean | sd | p0 | p50 | p100 | hist |
|---|---|---|---|---|---|---|
| birthYear | 1946.85 | 27.79 | 1839 | 1950.0 | 2015 | ▁▂▆▇▂ |
| averageRating | 6.00 | 1.16 | 1 | 6.1 | 10 | ▁▂▇▆▁ |
| numVotes | 6015.29 | 43514.99 | 10 | 125.0 | 2334927 | ▇▁▁▁▁ |
| startYear | 1985.28 | 26.46 | 1906 | 1990.0 | 2021 | ▁▂▃▅▇ |
| runtimeMinutes | 97.14 | 24.13 | 2 | 94.0 | 1500 | ▇▁▁▁▁ |
| age_lead | 38.43 | 12.95 | 1 | 36.0 | 98 | ▁▇▅▁▁ |
Let’s take a look:
The average birthYear is 1947, which is plausible considering the date range for movies in the IMDB (1906 - 2021).
The average movie rating is a 6.00, which can be a little confusing considering IMDB’s rating scale. Still, we can feel confident the data isn’t skewed because the mean and median (p50) are relatively close to each other.
The number of votes (numVotes) is the most skewed variable because it ranges from 10 to 2334927.
The startYear for the movie has an average of 1985, and increases steadily from 1906 to 2021, making sense because more films are being made every year.
The average length of each movie in ImdbData is 97.1 minutes (runtimeMinutes). But we can also see from the hist that the range for runtimeMinutes includes some very low and high values (p0 = 2 and p100 = 1500).
The actor/actress’s average age is 38.4, with a low of 1 and a high of 98 (both plausible).
We will proceed under the assumption that our stakeholders asked us to help explain the relationship between the average rating a movie received (averageRating) and the number of votes that went into the score (numVotes).
There are quite a few years in this dataset, so instead, we will split each measure into decades. To do this, we need a categorical variable from the startYear variable. The cut() function is handy because we can supply the number of breaks we want to split the numeric startYear variable into (12 in this case). We will also create some clear labels for this categorical variable with the labels argument and make sure the format is ordered.
We check our new factor variable with the fct_count() from the forcats package:
# click to execute code
ImdbData <- ImdbData %>%
mutate(year_cat10 = cut(x = startYear,
breaks = 12,
labels = c("1910s", "1920s", "1930s",
"1940s", "1950s", "1960s",
"1970s", "1980s", "1990s",
"2000s", "2010s", "2020s"),
ordered = TRUE))
# check the count of our factor levels
fct_count(f = ImdbData$year_cat10, sort = TRUE)We want to examine how the numVotes variable changed over time (year_cat10). Let’s review the numVotes variable below with skimr::skim():
ImdbData$numVotes %>% skimr::skim()| Name | Piped data |
| Number of rows | 136925 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 6015.29 | 43514.99 | 10 | 33 | 125 | 666 | 2334927 | ▇▁▁▁▁ |
We can see the values for this variable are concentrated, or skewed, towards 0.
The averageRating gives us the weighted average of all the individual user ratings. We will refresh our memory about this variable by taking a quick look at the skimr::skim() output below:
ImdbData$averageRating %>%
skimr::skim()| Name | Piped data |
| Number of rows | 136925 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 6 | 1.16 | 1 | 5.3 | 6.1 | 6.8 | 10 | ▁▂▇▆▁ |
We want each decade on the x axis and the numVotes for each film on the y.
The distribution for averageRating looks evenly distributed around the mean and median, and the hist looks relatively symmetrical. We will see if this holds for averageRating when we look at it across the levels of year_cat10.
Below is the mean, standard deviation (sd), minimum (p0), 25th percentile (p25), median (p50), 75th percentile (p75), maximum (p100), and hist for the numVotes, grouped by year_cat10:
# click to execute code
ImdbData %>%
group_by(year_cat10) %>%
select(year_cat10, numVotes) %>%
skimr::skim() %>%
skimr::focus(numeric.mean, numeric.sd,
numeric.p0, numeric.p50, numeric.p100,
numeric.hist)| Name | Piped data |
| Number of rows | 136925 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | year_cat10 |
Variable type: numeric
| skim_variable | mean | sd | p0 | p50 | p100 | hist |
|---|---|---|---|---|---|---|
| numVotes | 223.35 | 1327.77 | 10 | 22 | 22671 | ▇▁▁▁▁ |
| numVotes | 579.97 | 5005.06 | 10 | 27 | 112785 | ▇▁▁▁▁ |
| numVotes | 686.31 | 5486.03 | 10 | 56 | 167259 | ▇▁▁▁▁ |
| numVotes | 981.25 | 11114.29 | 10 | 60 | 520667 | ▇▁▁▁▁ |
| numVotes | 1174.25 | 9046.01 | 10 | 78 | 404337 | ▇▁▁▁▁ |
| numVotes | 1582.93 | 14908.81 | 10 | 68 | 687259 | ▇▁▁▁▁ |
| numVotes | 1635.41 | 20128.38 | 10 | 69 | 1614179 | ▇▁▁▁▁ |
| numVotes | 2717.53 | 30294.50 | 10 | 69 | 1228225 | ▇▁▁▁▁ |
| numVotes | 4471.46 | 33454.25 | 10 | 85 | 1266076 | ▇▁▁▁▁ |
| numVotes | 9814.88 | 65715.80 | 10 | 183 | 2334927 | ▇▁▁▁▁ |
| numVotes | 11634.93 | 60739.55 | 10 | 273 | 2295898 | ▇▁▁▁▁ |
| numVotes | 9034.78 | 49947.97 | 10 | 282 | 1512248 | ▇▁▁▁▁ |
Below is the mean, standard deviation (sd), minimum (p0), 25th percentile (p25), median (p50), 75th percentile (p75), maximum (p100), and hist for the averageRating, grouped by year_cat10:
# click to execute code
ImdbData %>%
group_by(year_cat10) %>%
select(year_cat10, averageRating) %>%
skimr::skim() %>%
skimr::focus(numeric.mean, numeric.sd,
numeric.p0, numeric.p50, numeric.p100,
numeric.hist)| Name | Piped data |
| Number of rows | 136925 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | year_cat10 |
Variable type: numeric
| skim_variable | mean | sd | p0 | p50 | p100 | hist |
|---|---|---|---|---|---|---|
| averageRating | 5.75 | 1.08 | 2.0 | 5.9 | 8.2 | ▁▁▆▇▂ |
| averageRating | 6.14 | 1.07 | 1.0 | 6.4 | 9.4 | ▁▁▃▇▁ |
| averageRating | 6.18 | 0.91 | 1.7 | 6.3 | 9.0 | ▁▁▅▇▁ |
| averageRating | 6.19 | 0.84 | 1.1 | 6.3 | 8.7 | ▁▁▂▇▁ |
| averageRating | 6.31 | 0.87 | 1.9 | 6.4 | 9.0 | ▁▁▅▇▁ |
| averageRating | 6.23 | 1.00 | 1.8 | 6.3 | 9.2 | ▁▁▆▇▁ |
| averageRating | 6.11 | 1.09 | 1.4 | 6.2 | 9.3 | ▁▁▆▇▁ |
| averageRating | 6.06 | 1.13 | 1.1 | 6.2 | 9.3 | ▁▁▆▇▁ |
| averageRating | 6.00 | 1.17 | 1.0 | 6.1 | 9.5 | ▁▂▇▇▁ |
| averageRating | 5.90 | 1.20 | 1.1 | 6.1 | 9.3 | ▁▂▇▇▁ |
| averageRating | 5.87 | 1.25 | 1.0 | 6.1 | 9.6 | ▁▂▇▇▁ |
| averageRating | 5.83 | 1.28 | 1.0 | 6.0 | 10.0 | ▁▂▇▅▁ |
How do you think the relationship between numVotes and averageRating will look?
We want to make direct comparisons across time by placing an individual plot for each decade within the same view. The best way to accomplish this is by using small multiples, where we repeat the same graph for each snapshot of time and present them in a grid.
The functions for creating small multiples in ggplot2 are facet_wrap() or facet_grid(). We will demonstrate this below using the former, and we also make some adjustments to the x axis to make it easier to read:
Inside the ggplot2::scale_x_continuous() function:
limits to 10 (the minimum) and 2334927 (the maximum)breaks argument specifies how we want to ‘break up’ our x axis. We will use the minimum, 1167464 (which is 0.5 x the maximum), and the maximumlabels specifies what text we want to display at each breakThe ggplot2::facet_wrap() function uses a ~ followed by the categorical variable (year_cat10) we’ve specified in the ggplot(aes(group = )) argument.
Now we can build our labels and small multiples graph:
# click to execute code
# build labels
labs_avgusr_nmvote_yearcat10 <- labs(
title = "Number of votes vs average individual user ratings over time",
subtitle = "Internet Movie Database (IMDB)",
caption = "https://www.imdb.com",
y = "Average individual user ratings",
x = "Number of votes")
gg_step4_facet_01 <- ImdbData %>%
ggplot(aes(x = numVotes,
y = averageRating,
group = year_cat10)) +
geom_point(size = 0.5,
alpha = 1/10,
show.legend = FALSE) +
# add x scale attributes
scale_x_continuous(limits = c(10, 2334927),
breaks = c(10, 1167464, 2334927),
labels = c('10', '~1.2M', '~2.3M')) +
# add facet
facet_wrap(~ year_cat10) +
labs_avgusr_nmvote_yearcat10
# save
# ggsave(plot = gg_step4_facet_01,
# filename = "gg-step4-facet-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step4_facet_01Open gg-step4-facet-01.png in the VS Code IDE above the Terminal console to view the graph.
We can see how small multiples allow for more incisive comparisons. The relationship between votes and average user rating becomes slightly evident as time progresses. A few data points approach the 8.75 line in 1960 and 1970, and there appears to be a slight upward trend for average user rating from 1990-2020.
We will use small multiples again, but this time we will view a log10 transformed numVotes variable. Transforming an axis can sometimes make the display easier to see, but we also need to find a way to explain this change to our audience.
We will need to update our labels, add the scale_x_log10() layer, and use facet_wrap() with year_cat10. We will also use the handy label_log10 function developed by Claus Wilke:
# click to execute code
# build labels
labs_avgusr_lognmvote_yearcat10 <- labs(
title = "*Number of votes vs average individual user ratings over time",
subtitle = "Internet Movie Database (https://www.imdb.com/)",
caption = "*Number of votes is has been log10 transformed",
y = "Average individual user ratings",
x = "log10(Number of votes)")
# load label_log10 function
source("https://bit.ly/35Ywt2q")
# build plot
gg_step5_facet_02 <- ImdbData %>%
ggplot(aes(x = numVotes,
y = averageRating,
group = year_cat10)) +
geom_point(size = 0.5,
alpha = 1/10,
show.legend = FALSE) +
# add log10 x axis
scale_x_log10(labels = label_log10) +
# facet wrap on decades
facet_wrap(~ year_cat10) +
# add labels
labs_avgusr_lognmvote_yearcat10
# save
# ggsave(plot = gg_step5_facet_02,
# filename = "gg-step5-facet-02.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step5_facet_02Open gg-step5-facet-02.png in the VS Code IDE above the Terminal console to view the graph.
Now the small multiples show the relationship between the log10 transformed number of votes and average user rating. The majority of the data points appear concentrated around the same value for average user rating (~6.125).
We can use ggplot2‘s geom_smooth() function to draw a smoothed ’best-fit line’ through the data points in each decade. Modeling is beyond the scope of this scenario, but feel free to read more about building models in this chapter of R for Data Science.
For now, know the following:
method = 'lm' tells geom_smooth to fit the best straight line through the data pointssize = 0.75 is the size of the linecolor = "firebrick2" will make the line color redfullrange = TRUE will draw the line through the entire range of the graph# click to execute code
# build labels
labs_avgusr_lognmvote_yearcat10_lm <- labs(
title = "*Number of votes vs average individual user ratings over time",
subtitle = "Internet Movie Database (https://www.imdb.com/)",
caption = "*Number of votes is has been log10 transformed; lm smooth line",
y = "Average individual user ratings",
x = "log10(Number of votes)")
gg_step5_facet_03 <- ImdbData %>%
ggplot(aes(x = numVotes,
y = averageRating,
geoup = year_cat10)) +
geom_point(size = 0.5,
alpha = 1/10,
show.legend = FALSE) +
# add smoothed line
geom_smooth(method = 'lm',
size = 0.75,
color = "firebrick2",
fullrange = TRUE) +
# add x axis attributes
scale_x_log10(labels = label_log10) +
# add facets
facet_wrap(~ year_cat10)
# save
# ggsave(plot = gg_step5_facet_03,
# filename = "gg-step5-facet-03.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step5_facet_03Open gg-step5-facet-03.png in the VS Code IDE above the Terminal console to view the graph.
We can see the relationship for the log10 transformed number of votes versus average user rating is slightly positive but has become less pronounced over time. The gray band is the standard error associated with the red smoothed line we’ve drawn (fewer data points = more error).
In this scenario we covered how to:
skim() variables to get summary statisticsmutate() and cut()ggplot2::facet_wrap()ggplot2::scale_x_log10() and label_log10()ggplot2::ggsave()We’ve concluded the “How to Create Small Multiples in R with ggplot2” scenario! Thank you for completing this scenario, and be sure to check out the other scenarios on the O’Reilly Learning Platform.